Re-estimation of Lexical Parameters for Treebank PCFGs
نویسنده
چکیده
We present procedures which pool lexical information estimated from unlabeled data via the Inside-Outside algorithm, with lexical information from a treebank PCFG. The procedures produce substantial improvements (up to 31.6% error reduction) on the task of determining subcategorization frames of novel verbs, relative to a smoothed Penn Treebank-trained PCFG. Even with relatively small quantities of unlabeled training data, the re-estimated models show promising improvements in labeled bracketing f-scores on Wall Street Journal parsing, and substantial benefit in acquiring the subcategorization preferences of low-frequency verbs.
منابع مشابه
Corpus Induction of Lexicons for Treebank PCFGs by Inside-Outside Estimation and Frequency Transformations
We describe procedures which pool lexical information from a treebank with frequency information estimated from an unannotated corpus with the insideoutside algorithm. PCFG parameters for non-lexical productions are obtained purely from the treebank. The procedures produce substantial improvements (upto 20.34%) on the task of determining valences of tokens of novel verbs, relative to a smoothed...
متن کاملInduction of Treebank-Aligned Lexical Resources
By ‘treebank-aligned lexical resources’ we mean ones where there is a systematic correspondence between the lexical resource and treebank syntactic resources. For instance, the lexicon resource contains features representing the subcategorization frames of verbs, which correspond to structural configurations that the verb occurs in, in a treebank. Given such an alignment, a treebank can be comp...
متن کاملThree-Dimensional Parametrization for Parsing Morphologically Rich Languages
Current parameters of accurate unlexicalized parsers based on Probabilistic ContextFree Grammars (PCFGs) form a twodimensional grid in which rewrite events are conditioned on both horizontal (headoutward) and vertical (parental) histories. In Semitic languages, where arguments may move around rather freely and phrasestructures are often shallow, there are additional morphological factors that g...
متن کاملStochastic Analysis of Lexical and Semantic Enhanced Structural Language Model
In this paper, we present a directed Markov random field model that integrates trigram models, structural language models (SLM) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. The SLM is essentially a generalization of shift-reduce probabilistic push-down automata thus more complex and powerful than probabilistic context free grammars (PCFGs)....
متن کاملReducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. We empirically analyze the diversity between two representative corpora, i.e. Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible. The analysis is f...
متن کامل